The material in this assignment is based on Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow, by Aurelien Geron.
In our case, we will be trying to build a model of California housing prices using Census data. The primary goal: be able to predict the median housing price in any California district, using the data available in this dataset. This problem is an example of regression, where the prediction of our model (or its output) is a continuous variable. This is in contrast to classification, where the prediction of our model (or its output) is a class or group.
The data description can be found here: https://github.com/ageron/handson-ml2/tree/master/datasets/housing
The data file itself is in data/housing.csv, found in the ./data directory of this module.
The typical steps in such an analysis vary depending on the problem, but they usually include the following:
We will go through all of these steps. We won't dwell on the details of the model - we will use it like a black box. Later on in the course, we will spend more time on the details.
In the code blocks below, we sometime give some hints on how to start. In other cases, we point you back to previous examples.
import pandas as pd
import plotly.io as pio
import matplotlib.pyplot as plt
import numpy as np
pio.renderers.default='notebook'
# Now let's print some data to the screem
housing = pd.read_csv('data/housing.csv')
print(housing.columns)
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'median_house_value', 'ocean_proximity'],
dtype='object')
As we did in chapter 1, we are going to want to explore the data.
print(housing.shape)
housing.head(5)
(20640, 10)
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
housing.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
desc = housing.describe()
desc
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
| mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
| std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
| min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
| 25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
| 50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
| 75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
| max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
corr = housing.corr()
corr.style.background_gradient().set_precision(3)
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| longitude | 1.000 | -0.925 | -0.108 | 0.045 | 0.070 | 0.100 | 0.055 | -0.015 | -0.046 |
| latitude | -0.925 | 1.000 | 0.011 | -0.036 | -0.067 | -0.109 | -0.071 | -0.080 | -0.144 |
| housing_median_age | -0.108 | 0.011 | 1.000 | -0.361 | -0.320 | -0.296 | -0.303 | -0.119 | 0.106 |
| total_rooms | 0.045 | -0.036 | -0.361 | 1.000 | 0.930 | 0.857 | 0.918 | 0.198 | 0.134 |
| total_bedrooms | 0.070 | -0.067 | -0.320 | 0.930 | 1.000 | 0.878 | 0.980 | -0.008 | 0.050 |
| population | 0.100 | -0.109 | -0.296 | 0.857 | 0.878 | 1.000 | 0.907 | 0.005 | -0.025 |
| households | 0.055 | -0.071 | -0.303 | 0.918 | 0.980 | 0.907 | 1.000 | 0.013 | 0.066 |
| median_income | -0.015 | -0.080 | -0.119 | 0.198 | -0.008 | 0.005 | 0.013 | 1.000 | 0.688 |
| median_house_value | -0.046 | -0.144 | 0.106 | 0.134 | 0.050 | -0.025 | 0.066 | 0.688 | 1.000 |
housing.columns[1]
'latitude'
Can you do this using a simple for loop?
#print(f'y = {y:.2f}')
#my_fig = plt.figure()
#my_ax = my_fig.add_subplot(1,1,1) # nrows=1, ncols=1, first plot
#my_ax.plot(t_pts, x_pts, color='blue', linestyle='--', label='sine')
#my_ax.set_xlabel('t')
#my_ax.set_ylabel(r'$\sin(t)$') # here $s to get LaTeX and r to render it
#my_ax.set_title('Sine wave')
for i in range(0,np.size(housing.columns)):
#print(housing.iloc[:,i])
fig, ax = plt.subplots()
ax.hist(housing.iloc[:,i])
ax.set_title(f'{housing.columns[i]}')
ax.set_ylabel('Number of elements')
plt.show()
Our goal in this assignment: predict the median_house_value given all of the other data. "median_house_value" will be our label. All of the other columns are our features.
Scatter plot median_house_value vs median_income. We would expect this to be highly correlated.
fig, ax = plt.subplots()
ax.scatter(housing['median_house_value'],housing['median_income'], marker='.')
ax.set_title('Median Value of a House versus Median Income in ')
ax.set_xlabel('Median House Value')
ax.set_ylabel('Median Income')
plt.show()
Feature engineering refers to combining existing features to form new ones. These combinations might be simple (like the result of adding/subracting/multiplying/etc) or they could be more complex - like the results of a sophisticated analysis. The basic idea is to add information for each candidate data point, which will hopefully improve whatever model we end up using to perform our predictions.
In our case there are some obvious new features we can create.
Below we should how to make the first new feature. You should add the other 3.
#
# The first one is done for you
housing['rooms_per_household'] = housing['total_rooms']/housing['households']
#
# Now do the other 3
housing['bedrooms_per_household'] = housing['total_bedrooms']/housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms']
housing['people_per_household'] = housing['population']/housing['households']
Think about the startified sampling that we did earlier, and note that by far the most correlated variable in our dataset in median_income. So when we split our data we would like to know for sure that our test sample is close in distribution to the median income of our train sample. Will this be true if we just randomly split the data? We aready know that the answer is "not quite".
To test this, let's make a categorical vaiable called income_cat which describes median income.
We will have 5 categories, running from 1.0 (low) to 5.0 (high):
To see how to do this, refer to the previous example from the section titled "An example of the power of pandas", from the workbook "module0_intro/module0_2_more_python_and_ploty.ipynb". In that case, we divided the Power Plant dataset into labels High, Medium, and Low exhaust vacuum. Here we will use labels 1 through 5, defined as above.
Remember to insert this column into our housing dataframe. Use the column name "income_cat" for this.
# Use the pandas cut method to define 5 regions
cat_labels = ['0 - 1.5','1.5 - 3.0','3.0 - 4.5','4.5 - 6.0','6.0 - 100.0']
income_cat = pd.cut(housing.median_income,bins=[0.0,1.5,3.0,4.5,6.0,100.0],labels=cat_labels)
# Insert the result of the above as a new column into our dataframe
housing.insert(8,'income_cat',income_cat)
housing.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | income_cat | median_house_value | ocean_proximity | rooms_per_household | bedrooms_per_household | bedrooms_per_room | people_per_household | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 6.0 - 100.0 | 452600.0 | NEAR BAY | 6.984127 | 1.023810 | 0.146591 | 2.555556 |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 6.0 - 100.0 | 358500.0 | NEAR BAY | 6.238137 | 0.971880 | 0.155797 | 2.109842 |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 6.0 - 100.0 | 352100.0 | NEAR BAY | 8.288136 | 1.073446 | 0.129516 | 2.802260 |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 4.5 - 6.0 | 341300.0 | NEAR BAY | 5.817352 | 1.073059 | 0.184458 | 2.547945 |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 3.0 - 4.5 | 342200.0 | NEAR BAY | 6.281853 | 1.081081 | 0.172096 | 2.181467 |
import plotly.express as px
housing.income_cat = pd.Categorical(housing.income_cat, categories=cat_labels,ordered=True)
fig = px.histogram(housing,x ='income_cat',barmode="overlay",histnorm='probability')
fig.show()
You could try:
Pick one! Simplest is dropping all rows with any missing data.
print(housing.shape)
housing = housing.dropna()
print(housing.shape)
(20640, 15) (20433, 15)
Out goal is to design an algorithm to prediction housing prices. To test our model, we will want to split our data into two parts:
Use a split of 80% train and 20% test, and do stratified sampling based on the income category variable "income_cat" we made above.
from sklearn.model_selection import train_test_split
train_housing,test_housing = train_test_split(housing, test_size=0.2, random_state=25,stratify=housing['income_cat'])
We will use feature scaling as we did with the fligth dataset. In this case, use MinMaxScaler. Remember: you need to use the training set to fit the transformer, and you need to use the transformer on both the training and test sets.
An example of how to do this for multiple columns is in DataSetPrep in the section Min-Max scaling and sci-kit learn estimators.
Remember that we do not use these techniques for categorical or object columns (something different will be done).
To figure out which colums are which. use the code below:
print("housing column types:") print(housing.dtypes)
print(f'Housing column types:\n {housing.dtypes}')
Housing column types: longitude float64 latitude float64 housing_median_age float64 total_rooms float64 total_bedrooms float64 population float64 households float64 median_income float64 income_cat category median_house_value float64 ocean_proximity object rooms_per_household float64 bedrooms_per_household float64 bedrooms_per_room float64 people_per_household float64 dtype: object
I ended up using the following columns for input to my minmax scaler:
["housing_median_age","total_rooms","total_rooms","total_bedrooms", "population","households","median_income", "rooms_per_household","bedrooms_per_household","bedrooms_per_room", "people_per_household"]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_housing[["housing_median_age","total_rooms","total_rooms","total_bedrooms",
"population","households","median_income",
"rooms_per_household","bedrooms_per_household","bedrooms_per_room",
"people_per_household"]])
train_scaled_minmax = scaler.transform(train_housing[["housing_median_age","total_rooms","total_rooms","total_bedrooms",
"population","households","median_income",
"rooms_per_household","bedrooms_per_household","bedrooms_per_room",
"people_per_household"]])
test_scaled_minmax = scaler.transform(test_housing[["housing_median_age","total_rooms","total_rooms","total_bedrooms",
"population","households","median_income",
"rooms_per_household","bedrooms_per_household","bedrooms_per_room",
"people_per_household"]])
The 'ocean_proximity' variable is a text variable, that we will want to one-hot enode. Look at "Dealing with Text Features" in DataSetPrep.
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
#
# Fit then transform with one-hot enocding
columnToEncode = 'ocean_proximity'
onehot_encoder.fit(train_housing[[columnToEncode]])
train_housing_ocean_one_hot = onehot_encoder.transform(train_housing[[columnToEncode]])
print("Transformed training data",type(train_housing_ocean_one_hot))
for i in range(10):
print(train_housing.iloc[i][columnToEncode], train_housing_ocean_one_hot[i])
test_housing_ocean_one_hot = onehot_encoder.transform(test_housing[[columnToEncode]])
#
#test_housing_ocean_one_hot = onehot_encoder.fit_transform(test_housing[[columnToEncode]])
print("Transformed testing data",type(test_housing_ocean_one_hot))
for i in range(10):
print(test_housing.iloc[i][columnToEncode],test_housing_ocean_one_hot[i])
Transformed training data <class 'numpy.ndarray'> <1H OCEAN [1. 0. 0. 0. 0.] <1H OCEAN [1. 0. 0. 0. 0.] NEAR BAY [0. 0. 0. 1. 0.] NEAR BAY [0. 0. 0. 1. 0.] INLAND [0. 1. 0. 0. 0.] <1H OCEAN [1. 0. 0. 0. 0.] NEAR OCEAN [0. 0. 0. 0. 1.] INLAND [0. 1. 0. 0. 0.] INLAND [0. 1. 0. 0. 0.] INLAND [0. 1. 0. 0. 0.] Transformed testing data <class 'numpy.ndarray'> <1H OCEAN [1. 0. 0. 0. 0.] <1H OCEAN [1. 0. 0. 0. 0.] <1H OCEAN [1. 0. 0. 0. 0.] INLAND [0. 1. 0. 0. 0.] INLAND [0. 1. 0. 0. 0.] <1H OCEAN [1. 0. 0. 0. 0.] <1H OCEAN [1. 0. 0. 0. 0.] NEAR OCEAN [0. 0. 0. 0. 1.] <1H OCEAN [1. 0. 0. 0. 0.] NEAR OCEAN [0. 0. 0. 0. 1.]
Refer to the ealrlier workbook titled "Putting Humpty-Dumpty back together!""
After all of our above work we should have:
We need to combine these so we have one training numpy array, and one testing numpy array. Along with each of these, we will have label arrays, made from the median_house_value column for the test and train samples.
print(train_scaled_minmax.shape)
print(train_housing_ocean_one_hot.shape)
print(test_scaled_minmax.shape)
print(test_housing_ocean_one_hot.shape)
(16346, 11) (16346, 5) (4087, 11) (4087, 5)
import numpy as np
# Numpy arrays for the labels:
train_housing_labels = train_housing['median_house_value'].copy().values
test_housing_labels = test_housing['median_house_value'].copy().values
#
# Numpy arrays for the features: we need to concatentate the scaled minmax data and the one-hot data
train_housing_toFit = np.concatenate([train_scaled_minmax,train_housing_ocean_one_hot], axis=1)
test_housing_toFit = np.concatenate([test_scaled_minmax,test_housing_ocean_one_hot], axis=1)
print(train_housing_toFit.shape)
print(test_housing_toFit.shape)
(16346, 16) (4087, 16)
As before the fit model will be linear regression (we are using more than just a single feature but is it still just linear regression). Test the fit both using RMSE as well as ploting the difference (predicted-true label) vs true label - but only use the test data.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression()
X = train_housing_toFit
y = train_housing_labels
#
# Fit the training data
model.fit(X,y)
#
# Use model.predict for both training and test data (don't fit the test data!)
print("Fit results: slope=",model.coef_," and intercept=",model.intercept_)
print()
print('Train Set Info')
train_housing_labels_pred = model.predict(X) # This puts an array of predictions for each X in ypred
print("Type returned from predict:",type(train_housing_labels_pred),"; shape: ",train_housing_labels_pred.shape)
train_housing_labels_pred = train_housing_labels_pred.reshape(len(train_housing_labels_pred))
print("Predictions",train_housing_labels_pred,train_housing.median_house_value.values)
#
train_lin_mse = mean_squared_error(y,train_housing_labels_pred)
train_lin_rmse = np.sqrt(train_lin_mse) ## Remember to take the square root!
print("Mean squared error and the root mean square",train_lin_mse,train_lin_rmse)
print()
print('Test Set Info')
test_X = test_housing_toFit
test_y = test_housing_labels
test_housing_labels_pred = model.predict(test_X) # This puts an array of predictions for each X in ypred
print("Type returned from predict:",type(test_housing_labels_pred),"; shape: ",test_housing_labels_pred.shape)
test_housing_labels_pred = test_housing_labels_pred.reshape(len(test_housing_labels_pred))
print("Predictions",test_housing_labels_pred,test_housing.median_house_value.values)
#
test_lin_mse = mean_squared_error(test_y,test_housing_labels_pred)
test_lin_rmse = np.sqrt(test_lin_mse) ## Remember to take the square root!
print("Mean squared error and the root mean square",test_lin_mse,test_lin_rmse)
Fit results: slope= [ 61774.29957191 9532.89883646 9532.89883646 132730.26034668 -1375944.39175411 548652.04756488 615298.7400346 886391.82634501 -842936.0253269 316854.35030682 58775.28438283 -24840.97011624 -88512.72063594 146091.08619833 -20650.6543009 -12086.74114525] and intercept= 13294.906326699565 Train Set Info Type returned from predict: <class 'numpy.ndarray'> ; shape: (16346,) Predictions [356613.10396553 277795.15226575 224855.1675851 ... 467126.87870633 331730.62329191 86234.55415563] [348000. 343000. 412500. ... 500001. 454800. 58200.] Mean squared error and the root mean square 4758572153.845578 68982.40466847744 Test Set Info Type returned from predict: <class 'numpy.ndarray'> ; shape: (4087,) Predictions [232142.36182361 128823.81987666 192505.5403689 ... 162384.19878088 455585.01327474 42995.52631529] [206700. 350000. 220000. ... 170900. 477100. 42700.] Mean squared error and the root mean square 4818026101.760018 69412.0025770761
import plotly.express as px
# NumPy arrays arguments
delta = test_housing_labels-test_housing_labels_pred
fig = px.scatter(x=test_y,y=delta,labels={'x':'label', 'y':'delta'},marginal_y='histogram')
fig.show()
import plotly.express as px
# NumPy arrays arguments
delta = train_housing_labels-train_housing_labels_pred
fig = px.scatter(x=y,y=delta,labels={'x':'label', 'y':'delta'},marginal_y='histogram')
fig.show()
If you are looking for more to do! Each counts as 1/3 of the total extra credit for this assignment.
We probably should have done this first.... but how do we know that our fit improved our knowledge? Is there a simple predictor that we could have used instead? How about if we predict the price simply based on the mean (or the median) of all housing prices? Use the mean squared error to do this.
Try another predictor from sklearn: RandomForestRegressor and/or DecisionTreeRegressor. Make sure you test the fit results (using mean_squared_error) on BOTH the training AND test sets!
Making maps This data is interesting since it has latitude and longitude. Previously we made world maps, but this depended on our data having tags which were country names. This is different. This will be more like a scatter-plot, but arranged on an existing map (primarily California). How do we do this? Google: plotly map scatter Take the code from the first example and modify it.
mean_of_training_data = np.mean(train_housing['median_house_value'])
#we use the mean of the training data as that is what the model would be based from
shape = np.size(test_housing_labels)
value = 3
means = np.empty(shape, dtype=np.int)
means.fill(mean_of_training_data)
#this setup above is to make the prediction labels which would be the mean for everything
test_y = test_housing_labels
test_housing_labels_pred = means # This puts an array of predictions for each X in ypred
print("Type returned from predict:",type(test_housing_labels_pred),"; shape: ",test_housing_labels_pred.shape)
test_housing_labels_pred = test_housing_labels_pred.reshape(len(test_housing_labels_pred))
print("Predictions",test_housing_labels_pred,test_housing.median_house_value.values)
#
test_lin_mse = mean_squared_error(test_y,test_housing_labels_pred)
test_lin_rmse = np.sqrt(test_lin_mse) ## Remember to take the square root!
print("Mean squared error and the root mean square",test_lin_mse,test_lin_rmse)
Type returned from predict: <class 'numpy.ndarray'> ; shape: (4087,) Predictions [206880 206880 206880 ... 206880 206880 206880] [206700. 350000. 220000. ... 170900. 477100. 42700.] Mean squared error and the root mean square 13076146440.742598 114350.9791857621
import plotly.express as px
# NumPy arrays arguments
delta = test_housing_labels-test_housing_labels_pred
fig = px.scatter(x=test_y,y=delta,labels={'x':'label', 'y':'delta'},marginal_y='histogram')
fig.show()
Try another predictor from sklearn: RandomForestRegressor and/or DecisionTreeRegressor. Make sure you test the fit results (using mean_squared_error) on BOTH the training AND test sets!
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = train_housing_toFit
y = train_housing_labels
#
# Fit the training data
model.fit(X,y)
#
# Use model.predict for both training and test data (don't fit the test data!)
print()
print('Train Set Info')
train_housing_labels_pred = model.predict(X) # This puts an array of predictions for each X in ypred
print("Type returned from predict:",type(train_housing_labels_pred),"; shape: ",train_housing_labels_pred.shape)
train_housing_labels_pred = train_housing_labels_pred.reshape(len(train_housing_labels_pred))
print("Predictions",train_housing_labels_pred,train_housing.median_house_value.values)
#
train_lin_mse = mean_squared_error(y,train_housing_labels_pred)
train_lin_rmse = np.sqrt(train_lin_mse) ## Remember to take the square root!
print("Mean squared error and the root mean square",train_lin_mse,train_lin_rmse)
print()
print('Test Set Info')
test_X = test_housing_toFit
test_y = test_housing_labels
test_housing_labels_pred = model.predict(test_X) # This puts an array of predictions for each X in ypred
print("Type returned from predict:",type(test_housing_labels_pred),"; shape: ",test_housing_labels_pred.shape)
test_housing_labels_pred = test_housing_labels_pred.reshape(len(test_housing_labels_pred))
print("Predictions",test_housing_labels_pred,test_housing.median_house_value.values)
#
test_lin_mse = mean_squared_error(test_y,test_housing_labels_pred)
test_lin_rmse = np.sqrt(test_lin_mse) ## Remember to take the square root!
print("Mean squared error and the root mean square",test_lin_mse,test_lin_rmse)
Train Set Info Type returned from predict: <class 'numpy.ndarray'> ; shape: (16346,) Predictions [364820.04 341333. 360486. ... 500001. 461567.33 65273. ] [348000. 343000. 412500. ... 500001. 454800. 58200.] Mean squared error and the root mean square 485891800.8305717 22042.9535414511 Test Set Info Type returned from predict: <class 'numpy.ndarray'> ; shape: (4087,) Predictions [178148. 107935. 193161. ... 163068. 487834.68 61454. ] [206700. 350000. 220000. ... 170900. 477100. 42700.] Mean squared error and the root mean square 3392005192.644656 58240.923693264485
import plotly.express as px
# NumPy arrays arguments
delta = test_housing_labels-test_housing_labels_pred
fig = px.scatter(x=test_y,y=delta,labels={'x':'label', 'y':'delta'},marginal_y='histogram')
fig.show()
import plotly.express as px
# NumPy arrays arguments
delta = train_housing_labels-train_housing_labels_pred
fig = px.scatter(x=y,y=delta,labels={'x':'label', 'y':'delta'},marginal_y='histogram')
fig.show()
Making maps
This data is interesting since it has latitude and longitude. Previously we made world maps, but this depended on our data having tags which were country names. This is different. This will be more like a scatter-plot, but arranged on an existing map (primarily California). How do we do this?
Google: plotly map scatter
Take the code from the first example and modify it. Pick something interesting to use for the size of the scatter points.
import plotly.graph_objects as go
fig = go.Figure(data=go.Scattergeo(
lon = housing['longitude'],
lat = housing['latitude'],
text = 'Median house value: ' + housing['median_house_value'].astype(str),
marker_size = (housing['bedrooms_per_household'])
))
fig.update_geos(fitbounds="locations")
fig.update_layout(
title = 'Median house values of houses in California with marker size based on bedrooms per household',
)
fig.show()